Speculative Throughput Computing

ABSTRACT

Systems, methods, and apparatuses including computer program products for speculative throughput computing are disclosed. Speculative throughput computing is used to translate a program to execute on a plurality of processors, processor cores, or threads.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application Ser.No. 60/897,969, for “System, methods, and business ideas for speculativeexecution of program segments in multiprocessors,” filed on Jan. 30,2007, which provisional patent application is incorporated by referenceherein in its entirety.

TECHNICAL FIELD

This subject matter generally relates to throughput computing.

BACKGROUND

A program (e.g., computer application) can be partitioned into aplurality of program segments. For example, a program can be partitionedinto program segments P₁, P₂, . . . , P_(N), where N is a number ofprogram segments. Conventional computing systems can execute the programsegments one after another in an enumerated order. For example, a singleprocessor computing system can execute P₁ before executing P₂, P₂ beforeexecuting P₃, and P_(N-1) before executing PN. Executing programsegments in this order respects sequential semantics (e.g., a programsegment with a higher enumeration order x reads from a memory locationbefore a program segment with a lower enumeration order y writes to thememory location, where x>y).

For example, a first program segment can have an enumeration order i,and a second program segment can have an enumeration order j, where i<j.The first program segment and the second program segment can be executedin parallel without violating sequential semantics if the programsegments do not access the same memory locations. Furthermore,sequential semantics is not violated if the first program segment doesnot write to a memory location after the second program segment readsfrom the memory location.

Multiprocessor, multicore, or multithreading computing systems canexecute program segments in parallel (e.g., executing program segmentsat substantially the same time) on a plurality of processors, processorcores, or threads. Executing program segments in parallel that were notoriginally designed to execute in parallel can be referred to as“speculative execution.”

Conventional compilers can partition a program into program segments bydetermining which program segments access the same memory locations. Dueto limitations of conventional analysis methods, or because accessedmemory locations are unknown at a time of compiling, many programscannot be partitioned by conventional compilers to allow for parallelexecution of the program segments.

For example, some conventional analysis methods execute writeinstructions of program segments at temporary memory locations. Theseconventional analysis methods create execution overhead associated withusing the temporary memory locations (e.g., storing and moving data fromthe temporary memory locations). Other conventional analysis methods usecentralized data structures to store original data of memory locationswhere write instructions of program segments write, so that the originaldata may be restored. Updating the centralized structure can causeexcessive overhead, especially if the write log is implemented insoftware using a dedicated data structure. Furthermore, if amiss-speculation occurs (e.g., when a program segment with anenumeration order i has written to a location that a program segmentwith an enumeration order j has already read, where i<j), programsegments with an enumeration order higher (e.g., greater) than j haltand redo their executions. Halting and redoing executions causesexecution overhead that can make speculative execution inefficient.

Furthermore, typical hardware and software implementations ofconventional analysis methods use complex mechanisms and areinefficient. For example, typical software implementations createexecution overhead because they use extra instructions, and cause poormemory system performance by lowering the memory locality, which canresult in cache misses.

Some implementations monitor fixed regions of memory (e.g., a fixedrange of one or more consecutive memory locations) to track if theregion has been modified. A region size that is too large can result infalse determinations of violations of sequential semantics. For example,a program segment may be forced to halt and redo its execution if itaccesses the same region as another program segment with a lowerenumeration order, even if none of the program segments accesses thesame location. Alternatively, a region size that is too small increasesthe overhead of monitoring read and write instructions.

In addition, conventional profiling methods (e.g., test executions ofprograms to determine properties of a program to predict gains fromspeculative execution of the programs) assume a single method forspeculative execution. Methods for speculative execution are alsoreferred to as “speculative methods” or “processes for speculativethroughput computing.” Furthermore, conventional dependence analyzersoften are not able to determine whether the program segments may beexecuted in parallel.

SUMMARY

Systems, methods, and apparatuses including computer program productsfor speculative throughput computing are disclosed. Speculativethroughput computing is used to translate a program to execute on aplurality of processors, processor cores, or threads.

Particular embodiments of the subject matter described in thisspecification can be implemented to realize one or more of the followingadvantages. An advantage of speculative throughput computing is that itreduces execution overheads by using speculative read and write computerinstructions.

An additional advantage of speculative throughput computing is that itreduces execution overhead related to commit operations by assuming thata speculation is valid, and by restoring content of speculativelymodified memory locations using a decentralized scheme.

An additional advantage of speculative throughput computing is that itdetermines speculative parallelism by executing program segments out ofsequential order.

An additional advantage of speculative throughput computing is that itincreases the accuracy of determining violations of sequential semanticsand reduces execution overhead by maintaining a precise range oflocations that have been speculatively accessed.

An additional advantage of speculative throughput computing is that itincreases speedup gains from speculative execution by selecting one of aplurality of speculative methods.

An additional advantage of speculative throughput computing is that byincreasing speedup gains, speculative throughput computing lowers anoperating frequency of a system and reduces energy consumption.

DESCRIPTION OF DRAWINGS

FIG. 1 illustrates an example multiprocessor computing system.

FIG. 2 shows an example process for translating a program to execute ona plurality of processors, processor cores, or threads.

FIG. 3 shows an example process for speculative throughput computing.

FIG. 4 illustrates structures for an example hardware implementation ofthe process for speculative throughput computing of FIG. 3.

FIG. 5 shows an example process for generating the structures of FIG. 4.

FIG. 6 shows an example process for a commit operation using thestructures of FIG. 4.

FIG. 7 shows an example process for a roll-back operation using thestructures of FIG. 4 and a back-track stack.

FIG. 8 illustrates an example back-track stack.

FIG. 9 illustrates structures for an example software implementation ofthe process for speculative throughput computing of FIG. 3.

FIG. 10 shows an example process for generating the structures of FIG.9.

FIG. 11 shows an example process for a commit operation using thestructures of FIG. 9.

FIG. 12 shows an example process for a roll-back operation using thestructures of FIG. 9 and a back-track stack.

FIG. 13 illustrates an example data dependence graph.

DETAILED DESCRIPTION Example Computing Systems

FIG. 1 illustrates an example multiprocessor computing system 100. Themultiprocessor computing system 100 includes processors (e.g.,processors 131, 132, and 133); cache memory coupled to the processors(e.g., private caches 121, 122, and 123); an interconnect 140; and amemory 110. The processors and the cache memory are coupled to theinterconnect 140 (e.g., a bus, or a crossbar switch). The interconnect140 allows cache memory to send a request for additional memory tomemory 110 or other cache memory.

In some implementations, additional hierarchies of memory can be used inthe multiprocessor computing system 100. For example, the memory 110 canbe a secondary cache that is coupled to additional memory. In someimplementations, the cache memory includes local memory that can beaccessed by a processor coupled to the local memory. For example, a reador write instruction that is executed by the processor 131 can accesslocal memory coupled to the processor 132 by invoking a software routinethat sends a signal to the processor 132. The signal can invoke asoftware routine executed by the processor 132, where the processor 132accesses the local memory and returns a value to the processor 131 bysending a signal to the processor 132.

Coherence (e.g., cache coherence) is maintained in the cache memory. Insome implementations, a write-invalidate cache coherence mechanism canbe used to maintain cache coherence. The write-invalidate cachecoherence mechanism can invalidate a block of memory (e.g., contiguouslocations of memory) in a cache memory when a processor coupled toanother cache memory writes to the block of memory. In someimplementations, a write-update cache coherence mechanism can be used tomaintain cache coherence. In particular, a block of memory is updatedwhen a processor coupled to another cache modifies the block of memory.In some implementations, a distribution protocol of invalidate andupdate requests can be one-to-all (e.g., snoopy cache protocols). Insome implementations, the distribution protocol of invalidate and updaterequests can be one-to-one (e.g., directory-based protocols).

In some implementations, one or more of the processors can include aplurality of independent processor cores (e.g., a multi-core processor).For example, a dual-core processor includes two processors cores, and aquad-core processor includes four processor cores. The processor corescan execute a plurality of threads (e.g., program segments created byforks or splits of a program, or threads of execution) in parallel.

Program Translation Overview

FIG. 2 shows an example process 200 for translating a program to executeon a plurality of processors, processor cores, or threads. For example,a computer application written in a high-level programming language(e.g., C, C++, Fortran, or Java) can be translated into a machinelanguage program that is executed on a processor. For convenience, theprocess 200 will be described with respect to a system that performs theprocess 200.

The system analyzes 210 data dependence of program segments of aprogram. In particular, the system determines whether program segmentsof the program, that can be sequentially executed (e.g., one afteranother) on a single processor computing system, can be executed inparallel (e.g., on a plurality of processors in a multiprocessorcomputing system, on a plurality of processor cores on a singleprocessor, on a plurality of threads).

The system determines one or more methods to speculatively execute readand write instructions of program segments in parallel for use in thetranslation. In particular, the system profiles 220 the program. Duringa profiling pass, the system executes the program to collect statistics.Some examples of statistics include but are not limited to: a number ofinstructions executed in a program segment, an average number of cyclesneeded by each instruction, and a number of speculative reads and writesin a program segment. The system then selects 230 one or morespeculative methods to use for translation. In some implementations, thesystem selects a combination of speculative methods.

Speculative Throughput Computing

FIG. 3 shows an example process 300 for speculative throughputcomputing. For convenience, the process 300 will be described withrespect to a system that performs the process 300. In someimplementations, the system generates 310 a precise range of locationsthat speculative read or write instructions have accessed. The range oflocations can be precise because the highest location (e.g., a maximumdata address in the range) has been accessed by a speculative read orwrite instruction. Furthermore, the lowest location (e.g., a minimumdata address in the range) has been accessed by a speculative read orwrite instruction.

In some implementations, the system generates a precise range oflocations that speculative read or write instructions have accessed foreach processor in a multiprocessor, processor core in a multi-coreprocessor, or thread. Other implementations are possible.

If a speculative write instruction has accessed a first precise range oflocations corresponding to a first program segment or a second preciserange of locations corresponding to a second program segment, the systemcan compare 320 the first precise range of locations with the secondprecise range of locations. In particular, the system can determine if aspeculative execution of the program segments in parallel respectssequential semantics. For example, the first program segment and thesecond program segment can be executed in parallel. The first programsegment can have an enumeration order less than an enumeration order ofthe second program segment. If the second program segment reads from amemory location before the first program segment writes to the memorylocation, then sequential semantics has been violated.

Sequential semantics can be violated if the ranges of locations overlap.For example, if the first precise range of locations overlaps with thesecond precise range of locations, then the first program segment andsecond program segment may have accessed the same memory location. Ifthe first precise range of locations overlaps with the second preciserange of locations and a location in the first precise range or thesecond precise range has been modified, the system identifies 330 amiss-speculation (e.g., a speculation that may not conform to sequentialsemantics).

If a miss-speculation is identified, the system restores 340 the memorycontent of locations speculative write instructions have accessed.Implementations of the process 300 for speculative throughput computingwill be described in further detail with respect to FIGS. 4-13.

Example Hardware Implementation for Speculative Throughput Computing

FIG. 4 illustrates structures 400 for an example hardware implementationof the process for speculative throughput computing of FIG. 3. Thestructures, in hardware, include a data structure 410 (e.g., an accessmatrix table), a write log 420, and a write log pointer 430. In someimplementations, each processor is extended with the structures (e.g.,data structure 410, write log 420, and write log pointer 430). In someimplementations, each processor core in a processor or thread isextended with the structures.

In some implementations, the data structure 410 is a table with a numberof entries. Each entry can include fields. Some examples of fieldsinclude but are not limited to: an instruction address field 411, a datavalidity field 412, a maximum data address field 413, a minimum dataaddress field 414, and an indicator field 415. The instruction addressfield 411 (e.g., “TAG” field) can store an instruction address (e.g., a64-bit address that identifies a location of an instruction in memory).The data validity field 412 (e.g., “V” field) can store, for example, asingle bit that indicates whether the data stored in an entry is valid.The maximum data address field 413 (e.g., “MAX” field) and the minimumdata address field 414 (e.g., “MIN” field) can store data addresses. Inparticular, a maximum data address is stored in the maximum data addressfield 413 and a minimum data address is stored in the minimum dataaddress field 414. The maximum data address field 413 and minimum dataaddress field 414 can be used to define a precise range of locations, asdescribed above with reference to FIG. 3. The indicator field 415 (e.g.,“RW” field) can store an indicator (e.g., a single bit) to indicatewhether the instruction corresponding to the entry is a read instructionor a write instruction.

In some implementations, the write log 420 can be a table with entriesthat can store addresses and data associated with write instructions. Insome implementations, a register can be used to store the write logpointer 430. In particular, the register can store an index of a nextfree entry in the write log 420.

The data associated with the write instructions can include an old value(e.g., a value of a location before a write instruction is executed), anew value (e.g., a value a write instruction will write to thelocation), and one or more status bits. For example, if a writeinstruction is executed, a value (e.g., new value) is written to amemory location. The original value (e.g., value in the memory locationbefore the write instruction was executed) is stored as the old value.

The size of the old values and new values (e.g., number of bits that areused to store the value) can be selected for different architectures.For example, a 32-bit architecture can include old values and new valuesthat are 32-bit values. As another example, a 64-bit architecture caninclude old values and new values that are 64-bit values.

The status bits can indicate whether the data stored in the old valueand new value is valid. The status bits can also be used when derivingthe old value of one or more memory locations. The number of status bitscan depend on the size of the values (e.g., old values, and new values).For example, each entry in the write log can include one status bit peraddressable unit (e.g., a status bit for each 8-bit value, or byte). Insome implementations, unused bits or bytes in a log entry are filledwith placeholder bits or bytes. For example, placeholders bits or bytescan be the same bit or byte values used to fill the unused bits or bytesin both the old value and the new value.

FIG. 5 illustrates an example process 500 for generating the structuresof FIG. 4. For convenience, the process 500 will be described withrespect to a system that performs the process 500.

The system can distinguish between two types of memory instructions:regular read and write instructions (e.g., read and write instructionsthat are not speculative) and speculative read and write instructions.The system determines 501 if a read or write instruction is speculative.If the read or write instruction is not speculative (“No” branch of step501), the read or write instruction is executed as usual (e.g., executedas a regular read or write instruction). Then, the system executes 513 anext instruction.

If the read or write instruction is speculative (“Yes” branch of step501), a speculative read or write instruction is executed. The systemlocates 502 the speculative read or write instruction in the datastructure (e.g., data structure 410 of FIG. 4) for a correspondingprocessor, processor core, or thread. In some implementations, the datastructure can be a hash structure and can use address mapping methods,such as, for example, fully-associative, direct-mapped, orset-associative address mapping methods to locate instructions.

If the execution of the speculative read or write instruction is a firstexecution of the speculative read or write instruction (e.g., a match isnot located in the data structure; “No” branch of step 503), then thesystem stores 504 the speculative read or write instruction. Inparticular, the system compares the instruction address of thespeculative read or write instruction to the TAG fields of all of theentries in the data structure. If the execution of the speculative reador write instruction is a first execution of the speculative read orwrite instruction (e.g., the instruction address does not match any ofthe TAG fields), then the speculative read or write instruction isstored in the data structure. The system can allocate a new entry or usean entry that was evicted earlier.

If the execution of the speculative read or write instruction is not thefirst execution (e.g., a match is located in the data structure; “Yes”branch of step 503), the system compares an effective address of thespeculative read or write instruction (e.g., a memory address that thespeculative read or write instruction will access) with a maximum dataaddress and a minimum data address. In particular, the system retrieves505 values from the MAX field and MIN field in the data structure. Ifthe effective address (e.g., ADDR of FIG. 5) is greater than the maximumdata address (“Yes” branch of step 506), then the system stores 507 theeffective address as the maximum data address. Otherwise (“No” branch ofstep 506), if the effective address is less than the minimum dataaddress (“Yes” branch of step 508), then the system stores 509 theeffective address as the minimum data address.

The system determines whether or not the speculative read or writeinstruction is a speculative write instruction. If the speculative reador write instruction is not a speculative write instruction (“No” branchof step 510), then the system performs step 513. If the speculative reador write instruction is a speculative write instruction (“Yes” branch ofstep 510), then the system sets 511 an indicator (e.g., a bit in the“RW” field) to identify a speculative write instruction. In addition,the system stores 512 the write, or the effective address and the dataassociated with the write instruction, in the write log (e.g., write log420). In particular, the effective address and the data associated withthe write log can be stored in a write log entry pointed to by the writelog pointer (e.g., write log pointer 430 of FIG. 4), and the write logpointer is incremented to point to a next free write log entry (e.g., byincrementing the value in the register). Then, the system performs step513.

Referring to FIG. 6, the system compares a first precise range oflocations with a second precise range of locations to determine whethera range of locations accessed by one processor, processor core, orthread overlaps with the ranges of locations accessed by otherprocessors, processor cores, or threads. The system determines if thereare entries in a first data structure that have not been compared. Ifthere are not more entries (“No” branch of step 602), then the systemstops 610. If there are more entries (“Yes” branch of step 602), thenthe system retrieves 603 a maximum data address (e.g., MAX), a minimumdata address (e.g., MIN), and an indicator (e.g., RW) from the datastructure for a first processor, processor core, or thread.

In addition, the system determines if there are entries in a second datastructure that have not been compared to the entry determined in step602. If there are not more entries in the second data structure (“No”branch of step 604), the system returns to step 602. If there are moreentries (“Yes” branch of step 604), the system retrieves 605 a maximumdata address (e.g., N.MAX), a minimum data address (e.g., N.MIN), and anindicator (e.g., N.RW) in the data structure for a second processor,processor core, or thread. The system compares the maximum data addressand the minimum data address from the first processor, processor core,or thread with the maximum data address and minimum data address fromthe second processor, processor core, or thread.

In particular, if the indicator in the data structure for the firstprocessor, processor core, or thread is set, or the indicator in thedata structure for a second processor, processor core, or thread is set(“Yes” branch of step 606); then a speculative write instruction hasaccessed the first precise range of locations corresponding to a firstprogram segment, or the second precise range of locations correspondingto a second program segment, respectively. If the indicators are not set(“No” branch of step 606), then the system returns to step 604.

If the maximum data address from the first processor is less than themaximum data address from the second processor, processor core, orthread; and the maximum data address from the first processor is greaterthan the minimum data address from the second processor (“Yes” branch ofstep 607); then the first precise range of locations overlaps with thesecond precise range of locations, and the system identifies 609 amiss-speculation. Otherwise (“No” branch of step 607), the systemperforms step 608.

If the minimum data address from the first process is greater than theminimum data address from the second processor, processor core, orthread; and the minimum data address from the first processor is lessthan the maximum data address from the second processor (“Yes” branch ofstep 608); then the first precise range of locations overlaps with thesecond precise range of locations, and the system identifies 609 amiss-speculation. Otherwise (“No” branch of step 608), the systemreturns to step 604. The system compares each entry for a processor,processor core, or thread with each entry in all of the otherprocessors, processor cores, or threads, in this manner.

If a miss-speculation is identified, the system restores memory contentof locations speculative write instructions have accessed. FIG. 7 showsan example process 700 for a roll-back operation using the datastructures of FIG. 4 and a back-track stack. In particular, FIG. 7 showsan example process for restoring memory content of a single location.The system can use the process 700 to restore memory content of alllocations speculative write instructions have accessed. In someimplementations, the process 700 is applied to one addressable unit(e.g., a byte) at a time. If the width of the values in the write log islarger than the addressable unit, the process 700 is applied for eachaddressable unit in the log entry (e.g., one unit at a time).

The system determines a processor, processor core, or thread that wrotea final value to a location in memory. The system stores a value of thelocation and an address of the location. For example, the system canstore 720 the value of the location in an “Actual” register and theaddress of the location (e.g., current address) in a “Current address”register. The value in the Actual register is compared 730 with all ofthe write log entries in all of the write logs. If the current addressis contained in a write log entry, a new value corresponding to thecurrent address is equal to the value of the location (e.g., value inthe Actual register), and the new value is not equal to a correspondingold value, then the system determines (“Yes” branch of step 740) amatching entry. The system stores the matching entry in a data structureto book-keep the matching entries (e.g., back-track stack 800 of FIG.8).

Referring to FIG. 8, the back-track stack 800 can include a number ofentries. In some implementations, the number of entries can be less thanor equal to a total number of entries in all of the write logs in thesystem. The back-track stack can be coupled to an interconnect (e.g.,interconnect 140 of FIG. 1) and can be accessed by one or moreprocessors, processor cores, or threads in the system. Also, each entryin the back-track stack 800 can include a current “Actual” field 810 anda “Path” field 820. The value in the Path field 820 can identify aparticular processor, processor core, or thread. Each entry in theback-track stack can also include a number of fields 830 equal to thenumber of processors, processor cores, or threads. Each field can storean entry number that identifies the write log entry corresponding to theprocessor, processor core, or thread for the matching entries. In someimplementations, the back-track stack can be a pointer structure. Apointer 840 keeps track of a next free entry in the back-track stack800.

Returning to FIG. 7, matching entries are stored 780 in a back-trackstack (e.g., back-track stack 800). The value in the Actual register isstored in the Actual field, and write log entry numbers for all matchingentries are stored in a corresponding field of the processor, processorcore, or thread of a matching entry.

An identifier of the particular processor, processor core, or thread(e.g., processor number one) is stored 790 in the Path field. Thematching write log entry for the particular processor, processor core,or thread can be called the current entry. In some implementations, achecked bit for the current address in the current entry is marked 790.

The system determines a next value. The next value is an old valuestored in the current entry in the write log(e.g., Actual.old). The nextvalue is stored 795 in the Actual register. If there are matchingentries remaining (e.g., the system determines another matching entry;“Yes” branch of step 740), the system returns to step 780.

If there is not another matching entry (“No” branch of step 740), thesystem determines if all entries have been checked. In particular, thesystem checks whether all write log entries containing the currentaddress have the checked bit set, or if the old values and new valuesfor the current address are equal. If all entries have been checked(“Yes” branch of step 750), then the system restores 760 a memory valueand terminates 710. In particular, the system stores the Actual value inthe location, thereby restoring the original value of the location.

If all entries have not been checked (e.g., determining that a write logentry containing the current address does not have a checked bit set,and the old values and new values for the current address are not equal;“No” branch of step 750), the system returns, or back-tracks, 770 to theprevious entry in the back-track stack that included multiple matchingentries (e.g., more than one of the processor, processor core, or threadfields are non-empty). The last entry used is pointed to by a pointer(e.g., pointer 840 of FIG. 8). The system clears the checked bitassociated with the current address in the entry stored in theprocessor, processor core, or thread number field of the processornumber in the Path field; the processor, processor core, or threadnumber field; and the Path field. If there is another non-emptyprocessor number field in the entry, the entry in this field becomes thecurrent entry. The checked bit in the current entry is marked 790, andthe corresponding processor, processor core, or thread number is storedin the Path field. If there are no more matching entries in theprocessor, processor core, or thread number fields, the system returnsto the previous entry in the back-track stack (e.g., the system returnsto step 770).

Example Software Implementation for Speculative Throughput Computing

The process for speculative throughput computing can be implemented insoftware. For example, the access matrix table can be implemented as adata structure in virtual memory that can be accessed by regular readand write instructions. The speculative read and write instructions canbe emulated using a sequence of regular computer instructions thataccess the data structure in virtual memory (e.g., global variables, andpointer structures).

FIG. 9 illustrates data structures 900 in virtual memory for an examplesoftware implementation of the process for speculative throughputcomputing. For example, FIG. 9 illustrates a matrix with rows that arepointed to by pointer variables. For each pointer variable (e.g., PTR1,PTR2, . . . PTRM; where M>0), a row can be allocated in the datastructure 900 at the time a computer program (e.g., a computer programwritten in C programming language) is compiled. Each pointer can beassociated with a number of elements equal to a number of programsegments (e.g., P1, P2, . . . PN; where N>0). A “SUM” element can alsobe associated with each pointer.

Each element can include but is not limited to: a maximum data address(e.g., MAX entity), a minimum data address (e.g., MIN entity), and anindicator (e.g., W entity). The MAX entity can store a maximum dataaddress accessed by a corresponding program segment. The MIN entity canstore a minimum data address accessed by a corresponding programsegment. The W entity can identify whether a location between themaximum data address and the minimum data address has been modified(e.g., accessed by a write instruction) by the corresponding programsegment.

The software implementation can also generate a write log and write logpointer in virtual memory that is analogous to the write log 420 andwrite log pointer 430 of FIG. 4. The write log can be generated using,for example, global variables or a pointer structure.

FIG. 10 shows an example process 1000 for generating the structures ofFIG. 9. For convenience, process 1000 will be described with respect toa system that performs the process 1000.

In a software implementation, a speculative read or write instructioncan be emulated using regular read or write instructions (e.g., load orstore instructions). For example, the speculative read or writeinstructions can be regular read or write instructions that areaugmented with a sequence of ordinary instructions (e.g., checkingcode). The checking code can be used to determine 1020 if a read orwrite instruction is speculative. If the read or write instruction isnot speculative (“No” branch of step 1020), then the read or writeinstruction is executed as usual (e.g., executed as a regular read orwrite instruction) and the system executes 1010 a next instruction.

If the read or write instruction is speculative (“Yes” branch of step1020), then the system compares an effective address (e.g., ADDR of FIG.10) of the speculative read or write instruction with a maximum dataaddress and minimum data address. For example, the system retrieves 1030MAX, MIN, and W entities from a data structure using the address for thepointer and the program segment number. If the effective address isgreater than the maximum data address (“Yes” branch of step 1040), thenthe system stores 1050 the effective address as the maximum dataaddress. Otherwise (“No” branch of step 1040), if the effective addressis less than the minimum data address (“Yes” branch of step 1055), thenthe system stores 1060 the effective address as the minimum dataaddress.

The system determines whether or not the speculative read or writeinstruction is a speculative write instruction. If the speculative reador write instruction is not a speculative write instruction (“No” branchof step 1070), then the system executes 1010 a next instruction. If thespeculative read or write instruction is a speculative write instruction(“Yes” branch of step 1070), then the system sets 1080 an indicator(e.g., a “W” bit) to identify a speculative write instruction. Inaddition, the system stores 1090 the write, or the effective address andthe data associated with the write instruction, in a write log (e.g., awrite log analogous to write log 420, but implemented in virtualmemory). In particular, the effective address and the data associatedwith the write log can be stored in a write log entry pointed to by awrite log pointer, and the write log pointer is incremented to point toa next free write log entry. Then, the system returns to step 1010.

Referring to FIG. 11, the system compares a first precise range oflocations with a second precise range of locations to determine whethera range of locations accessed by a program overlaps with the ranges oflocations accessed by other program segments. The system selects 1100 afirst pointer (e.g., a first pointer row, pointed to by PTR1, in thematrix of FIG. 9). The system selects 1110 a first element for the firstpointer, and retrieves the MAX, MIN, and W entities corresponding to thefirst element. The system then selects 1115 a next element for the firstpointer, and retrieves the MAX, MIN, and W entities corresponding to thenext element. If the first pointer or the next element do not exist,then the process terminates. The system compares the maximum dataaddress (e.g., MAX entity) and the minimum data address (e.g., MINentity) from the first element with the maximum data address (e.g. MAXentity) and minimum data address (e.g., MIN entity) from the nextelement.

The system determines if there is a harmful overlap. In particular, ifthe indicator in the first element is set, or the indicator in the nextelement is set; then a speculative write instruction has accessed thefirst precise range of locations corresponding to the first programsegment, or the second precise range of locations corresponding to thesecond program segment, respectively. If the maximum data address fromthe first element is less than the maximum data address from the nextelement; and the maximum data address from the first element is greaterthan the minimum data address from the next element; then the firstprecise range of locations overlaps with the second precise range oflocations. The system determines (“Yes” branch of step 1120) a harmfuloverlap, and the system identifies 1140 a miss-speculation. If theminimum data address from the first element is greater than the minimumdata address from the next element; and the minimum data address fromthe first element is less than the maximum data address from the nextelement; then the first precise range of locations overlaps with thesecond precise range of locations. The system determines a harmfuloverlap (“Yes” branch of step 1120), and the system identifies 1140 amiss-speculation.

If the system does not determine a harmful overlap (“No” branch of step1120), the system determines if there are more entries for the pointer.The system compares each element associated with a pointer with all ofthe other elements associated with the pointer, in this manner. Forexample, assume that a row of a current pointer includes elements A, B,and C. The system compares pairs A with B, A with C, and B with C. Ifthere are more entries for the pointer (“Yes” branch of step 1130), thenthe system returns to step 1115.

If there are not more entries for the pointer (“No” branch of step1130), then the system determines and stores 1135 the values (e.g., MAXentity, MIN entity, and W entity) in the SUM element for the row. Inparticular, the system determines the maximum of all the MAX entities(e.g., highest maximum data address) and the minimum of all the MINentities (e.g., lowest minimum data address) in the row. For example,assume that a row includes elements A, B, and C. The determined maximumis the highest MAX value of A, B, and C; and the determined minimum willbe the lowest MIN value of A, B, and C. The determined maximum andminimum values are stored in the MAX and MIN entities of the SUM elementof the row.

The system computes the W entity as the logical OR operation of the Wentities of all the elements in the row. For example, assume that a rowincludes elements A, B, and C. The W entity of the SUM element iscomputed using the expression:

W(A) OR W(B) OR W(C), where

W(x) represents the W entity of element x.

If there are more pointers to compare (“Yes” branch of step 1105), thenthe system returns to step 1110. If there are not more pointers tocompare (“No” branch of step 1105), then the system compares the SUMelements. In particular, the system selects 1145 a first SUM element,and retrieves MAX, MIN, and W entities corresponding to the first SUMelement. The system selects 1150 a next SUM element, and retrieves MAX,MIN, and W entities corresponding to the next SUM element. The systemuses these entities to compare a first precise range of locations with asecond precise range of locations to determine whether a range oflocations accessed by a processor, processor core, or thread overlapswith the ranges of locations accessed by other processors, processorcores, or threads.

The system determines miss-speculations using a process (e.g., steps1155 and 1165) analogous to the process described previously withreference to steps 1120 and 1130. If there are no more entries for SUM(“No” branch of step 1065), then the system stops. In particular, if thesystem does not identify a harmful overlap (“No” branch of step 1155),the system determines if the system has compared all SUM elements. Ifthe system has compared all SUM elements (“Yes” branch of step 1165),the system stops 1170. Otherwise (“No” branch of step 1165), the systemreturns to step 1150. The SUM elements are compared in a manneranalogous to the comparison of the row elements described previously.

If a miss-speculation is identified, the system restores the memorycontent of locations speculative write instructions have accessed in amanner analogous to that described above with reference to FIG. 7. Insome implementations, the system uses, however, computer instructionsinstead of data structures in hardware. For example, pointer structurescan be used to implement the data structures, write log, and write logpointer. The system uses the write log and a back-track stack to restorethe memory content of locations speculative write instructions haveaccessed.

Reducing Miss-Speculations

The example hardware and software implementations of the process forspeculative throughput computing, described previously, illustratedprocesses to identify miss-speculations. In some implementations,miss-speculations can also be reduced in the process for speculativethroughput computing.

FIG. 12 illustrates an example process 1200 for reducingmiss-speculations. For convenience, process 1200 will be described withrespect to a system that performs the process 1200. The system generatesa schedule of program segments for a program to reduce a number ofmiss-speculations. In particular, the system derives 1210 a datadependence graph. The system determines 1220 an execution order of theprogram segments from the data dependence graph. The system executes1230 the program segments according to the execution order. The systemcompares 1240 the program segment executions to identify dependencies.

FIG. 13 illustrates an example data dependence graph 1300. The datadependence graph can be derived from a computer program. For example,assume that a computer program includes the following code:

while condition do for i=1 to 4 do  for j=1 to 4 do  begin    A[i,j] :=A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1];  end;

In each iteration of the loop, a matrix element with index variables iand j (e.g., A[i,j]) is the sum of four neighbor matrix elements (e.g.,A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]). If the iterations are enumeratedin the order that they are executed (e.g., a first iteration isrepresented by state 1, and a second iteration by state 2), theresulting dependencies between iterations are derived, as illustrated inFIG. 13. For example, in the first enumeration (e.g., represented bystate 1), i=1 and j=1, so A[1,1] is derived. As another example, in thesecond enumeration (e.g., represented by state 2), i=1 and j=2, soA[1,2] is derived. As a further example, in the fifth enumeration (e.g.,represented by state 5), i=2 and j=1, so A[2,1] is derived. The fifthenumeration derives A[2,1]=A[1,1]+A[3,1]+A[2,0]+A[2,2]. The fifthenumeration does not depend on the second enumeration (A[1,2]), but thefifth enumeration depends on the first enumeration (A[1,1]). The arrowsin the state graph illustrate the dependencies.

The program segments to be executed in parallel can be formed using eachof the iterations. Because there are dependencies between almost all ofthe consecutive iterations (e.g., 4 depends from 3, 2, and 1),uncovering parallelism can be difficult. The data dependence graphallows a system to uncover parallelism, for example, in the loop. Forexample, program segments with enumeration order 4, 7, 10, and 13 can beexecuted in parallel.

In order to derive the data dependence graph, the system augments theprogram with trigger points that delimit program segments. In someimplementations, the trigger points increment a variable that enumeratesthe enumeration order of the program segment. For example, the programloop (code illustrated above) can be augmented with a trigger point(e.g., “trigger;”) as in the following example code:

while condition do for i=1 to 4 do  for j=1 to 4 do  begin    trigger;   A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1];  end;

Each iteration in the program loop includes a call to the trigger point.The trigger point can increment a variable that enumerates theenumeration order of the program segment. The system also augments theprogram with markings of read and write instructions that potentiallycause dependencies. The markings identify the instructions asspeculative instructions.

If the program executes sequentially, a trap is generated each time aspeculative read or write instruction is executed. A software routine(e.g., a trap handler) records the address and type (e.g., read orwrite) of speculative instruction in a file. The enumeration order ofthe program segment is also recorded in the file. Post-processing of thefile is used to derive 1210 the data dependence graph, and the systemdetermines 1220 an execution order of the program segments from the datadependence graph.

The system executes 1230 the program segments according to the executionorder. In particular, the system can use a programming construct toexecute program segments in parallel. For example, a construct“parallel_for (i,j)=(x1,y1; x2,y2; . . . ; xn,yn)” can be interpreted sothat program segments with index variables (x1,y1), (x2,y2), . . . , and(xn,yn) execute speculatively in parallel. As a further example, anexample program loop that uses the construct includes the code:

while condition do begin parallel_for (i,j) =(1,1)   A[i,j] :=A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; parallel_for (i,j) =(2,1;1,2)  A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; parallel_for (i,j)=(3,1;2,2;1,3)   A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1];parallel_for (i,j) =(4,1;3,2;2,3;1,4)   A[i,j] :=A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; parallel_for (i,j) =(4,2;3,3;2,4)  A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; parallel_for (i,j)=(4,3;3,4)   A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; parallel_for(i,j) =(4,4)   A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; commit;end;

The program loop includes seven “parallel_for” constructs. The systemuses the construct to execute 1230 the program segments according to theexecution order. For example, the second “parallel_for” construct allowsthe system to execute a set of program segments (2,1) and (1,2) inparallel. Alternatively, consecutive “parallel_for” constructs areexecuted sequentially. For example, the third “parallel_for” constructallows the system to execute program segments (3,1), (2,2) and (1,3) inparallel, after the system executes program segments (2,1) and (1,2) inparallel.

The system compares 1240 the program segment executions to identifydependencies.

Determining whether or not a program segment in a later “parallel_for”construct with a higher enumeration order causes a dependency with aprogram segment in an earlier “parallel_for” construct with a lowerenumeration order may be difficult. In some implementations, the systemcompares 1240 the program segment executions to identify dependencies,after all of the program segments have been executed (e.g., at a“commit” point in the program).

Program Translation and Speculative Execution

Returning to FIG. 2, the system analyzes 210 data dependence of programsegments of a program, and the system profiles 220 the program. Inparticular, the system collects statistics for use in selecting aspeculative method (e.g., a process for speculative throughputcomputing). In some implementations, the statistics include a number ofinstructions executed in program segment i (N_(i)), where i is aninteger greater than zero; an average number of cycles used by aninstruction (CPI_(i)); a number of speculative read instructions inprogram segment i (R_(i)); and a number of speculative writeinstructions in program segment i (W_(i)).

The system predicts a speedup gain from speculative execution. Inparticular, the system determines an execution time of a sequentialexecution of the program (T_(sequential)). The system derives thespeedup gain by applying an analytical model. For example, theanalytical model can include execution times represented by an equation:

T _(exec)(K)=Max[T_(segment-start)(K)+N _(i) ·CPI _(i) +R _(i) ·R_(cost)(K)+W _(i) ·W _(cost)(K)+P _(roll-back)·Roll-back_(cost)(K)+(1−P_(roll-back))·C _(cost)(K)], where:

K identifies one of a plurality of speculative methods; Max is a maximumfunction; T_(segment-start) models a start-up cost of initiating aspeculative thread; R_(cost) models a cost of a speculative readinstruction; W_(cost) models a cost of a speculative write instruction;P_(roll-back) is a probability of a miss-speculation; Roll-back_(cost)models a cost of a roll-back; and C_(cost) models a cost of a commit.

The expression inside the maximum function (Max) is an estimatedexecution time for program segment i. The maximum function determines anexecution time of a slowest program segment for a speculative method K.Therefore, the analytical model includes execution times of a pluralityof speculative methods, and the execution times of the plurality ofspeculative methods equal the execution times of a slowest programsegment. Applying T_(exec)(K) from the analytical model, the system canpredict a speedup gain using an equation:

Speedup(K)=T _(sequential) /T _(exec)(K).

The system then selects 230 one or more of a plurality of speculativemethods (e.g., processes described with reference to FIGS. 3-13) to usefor translation using the speedup gains for each speculative method. Forexample, the system selects the speculative method with the highestspeedup gain. In some implementations, the system selects sequentialexecution if sequential execution has a lower execution time than thespeculative methods (e.g., T_(sequential)<T_(exec)(K), for all K).

In some implementations, the system selects a combination of speculativemethods. For example, the system can use a combination of parts of oneor more speculative methods of the plurality of speculative methods. Asanother example, the system can use a combination of one or morespeculative methods of the plurality of speculative methods executedsequentially. As yet another example, the system can use a combinationof one or more speculative methods of the plurality of speculativemethods executed in parallel.

In some implementations, the process 200 can be included in a modulethat can be integrated with a compiler. In some implementations, acombination of one or more steps of process 200 (e.g., combinations ofsteps 210, 220, and 230) can be included in a module. In someimplementations, the modules can be integrated with compilers. Forexample, a first compiler can include a module that can analyze datadependence of the program (e.g., step 210). A second compiler caninclude a module that can perform the other steps of process 200 (e.g.,steps 220 and 230). In some implementations, an apparatus (e.g.,multiprocessor computing system 100, multicore computing system, ormulti-threaded computing system) can perform speculative execution(e.g., according to some or all of the processes described withreference to FIGS. 3-13). In some implementations, a system thatincludes the apparatus and the compiler can perform speculativeexecution (e.g., according to some or all of the methods described withreference to FIGS. 3-13).

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, or in computer software, firmware, or hardware, including thestructures disclosed in this specification and their structuralequivalents, or in combinations of one or more of them. Embodiments ofthe subject matter described in this specification can be implemented asone or more computer program products, e.g., one or more modules ofcomputer program instructions encoded on a tangible program carrier forexecution by, or to control the operation of, data processing apparatus.The tangible program carrier can be a propagated signal or acomputer-readable medium. The propagated signal is an artificiallygenerated signal, e.g., a machine-generated electrical, optical, orelectromagnetic signal, that is generated to encode information fortransmission to suitable receiver apparatus for execution by a computer.The computer-readable medium can be a machine-readable storage device, amachine-readable storage substrate, a memory device, a composition ofmatter effecting a machine-readable propagated signal, or a combinationof one or more of them.

The term “data processing apparatus” encompasses all apparatus, devices,and machines for processing data, including by way of example aprogrammable processor, a computer, or multiple processors or computers.The apparatus can include, in addition to hardware, code that creates anexecution environment for the computer program in question, e.g., codethat constitutes processor firmware, a protocol stack, a databasemanagement system, an operating system, or a combination of one or moreof them.

A computer program (also known as a program, software, softwareapplication, script, or code) can be written in any form of programminglanguage, including compiled or interpreted languages, or declarative orprocedural languages, and it can be deployed in any form, including as astand-alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment. A computer program does notnecessarily correspond to a file in a file system. A program can bestored in a portion of a file that holds other programs or data (e.g.,one or more scripts stored in a markup language document), in a singlefile dedicated to the program in question, or in multiple coordinatedfiles (e.g., files that store one or more modules, sub-programs, orportions of code). A computer program can be deployed to be executed onone computer or on multiple computers that are located at one site ordistributed across multiple sites and interconnected by a communicationnetwork.

The processes and logic flows described in this specification can beperformed by one or more programmable processors executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby, and apparatus can also be implemented as, special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application-specific integrated circuit).

Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processors of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read-only memory ora random access memory or both. The essential elements of a computer area processor for performing instructions and one or more memory devicesfor storing instructions and data. Generally, a computer will alsoinclude, or be operatively coupled to receive data from or transfer datato, or both, one or more mass storage devices for storing data, e.g.,magnetic, magneto-optical disks, or optical disks. However, a computerneed not have such devices. Moreover, a computer can be embedded inanother device, e.g., a mobile telephone, a personal digital assistant(PDA), a mobile audio or video player, a game console, a GlobalPositioning System (GPS) receiver, to name just a few.

Computer-readable media suitable for storing computer programinstructions and data include all forms of non-volatile memory, mediaand memory devices, including by way of example semiconductor memorydevices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks,e.g., internal hard disks or removable disks; magneto-optical disks; andCD-ROM and DVD-ROM disks. The processor and the memory can besupplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or of what may be claimed, but rather as descriptions offeatures that may be specific to particular embodiments of particularinventions. Certain features that are described in this specification inthe context of separate embodiments can also be implemented incombination in a single embodiment. Conversely, various features thatare described in the context of a single embodiment can also beimplemented in multiple embodiments separately or in any suitablesubcombination. Moreover, although features may be described above asacting in certain combinations and even initially claimed as such, oneor more features from a claimed combination can in some cases be excisedfrom the combination, and the claimed combination may be directed to asubcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various systemcomponents in the embodiments described above should not be understoodas requiring such separation in all embodiments, and it should beunderstood that the described program components and systems cangenerally be integrated together in a single software product orpackaged into multiple software products.

Particular embodiments of the subject matter described in thisspecification have been described. Other embodiments are within thescope of the following claims. For example, the actions recited in theclaims can be performed in a different order and still achieve desirableresults. As one example, the processes depicted in the accompanyingfigures do not necessarily require the particular order shown, orsequential order, to achieve desirable results. In certainimplementations, multitasking and parallel processing may beadvantageous.

1. A method comprising: translating a program to execute on a pluralityof processors, processor cores, or threads including: analyzing datadependence of program segments of the program; collecting statisticsthat include at least one of: a number of instructions executed in aprogram segment, an average number of cycles used by an instruction, anda number of speculative reads and writes in a program segment; andpredicting a speedup gain from speculative execution including:determining an execution time of a sequential execution of the program,and deriving the speedup gain by applying an analytical model.
 2. Themethod of claim 1, where the analytical model includes execution timesof a plurality of speculative methods, and the execution times of theplurality of speculative methods equal the execution times of a slowestprogram segment.
 3. The method of claim 2, where translating the programto execute on a plurality of processors, processor cores, or threadsfurther comprises: selecting a selected speculative method from theplurality of speculative methods using the speedup gain.
 4. The methodof claim 3, further comprising: executing the program using the selectedspeculative method.
 5. The method of claim 2, where the execution timesare represented by the equationT_(exec)(K)=Max[T_(segment-start)(K)+N_(i)·CPI_(i)+R_(i)·R_(cost)(K)+W_(i)·W_(cost)(K)+P_(roll-back)·Roll-back_(cost)(K)+(1−P_(roll-back))·C_(cost)(K)],where: K identifies one of a plurality of speculative methods; Max is amaximum function; T_(segment-start) models a start-up cost of initiatinga speculative thread; R_(cost) models a cost of a speculative readinstruction; W_(cost) models a cost of a speculative write instruction;P_(roll-back) is a probability of a miss-speculation; Roll-back_(cost)models a cost of a roll-back; C_(cost) models a cost of a commit; N_(i)is a number of instructions executed in program segment i; CPI_(i) is anaverage number of cycles used by an instruction; R_(i) is a number ofspeculative read instructions in program segment i; and W_(i) is anumber of speculative write instructions in program segment i.
 6. Themethod of claim 5, where the speedup gain is represented by the equationSpeedup(K)=T_(sequential)/T_(exec)(K), and T_(sequential) is theexecution time of the sequential execution of the program.
 7. The methodof claim 3, where the plurality of speculative methods comprisesspeculative methods that include: a combination of parts of one or morespeculative methods of the plurality of speculative methods.
 8. Themethod of claim 3, where the plurality of speculative methods comprisesspeculative methods that include: a combination of one or morespeculative methods of the plurality of speculative methods executedsequentially.
 9. The method of claim 3, where the plurality ofspeculative methods comprises speculative methods that include: acombination of one or more speculative methods of the plurality ofspeculative methods executed in parallel.
 10. A computer programproduct, encoded on a computer-readable medium, operable to cause a dataprocessing apparatus to: translate a program to execute on a pluralityof processors, processor cores, or threads including: analyzing datadependence of program segments of the program; collecting statisticsthat include at least one of: a number of instructions executed in aprogram segment, an average number of cycles used by an instruction, anda number of speculative reads and writes in a program segment; andpredicting a speedup gain from speculative execution including:determining an execution time of a sequential execution of the program,and deriving the speedup gain by applying an analytical model.
 11. Thecomputer program product of claim 10, where the analytical modelincludes execution times of a plurality of speculative methods, and theexecution times of the plurality of speculative methods equal theexecution times of a slowest program segment.
 12. The computer programproduct of claim 11, where translating the program to execute on aplurality of processors, processor cores, or threads further comprises:selecting a selected speculative method from the plurality ofspeculative methods using the speedup gain.
 13. The computer programproduct of claim 12, further comprising: executing the program using theselected speculative method.
 14. A system comprising: one or moreprocessors or processor cores; a computer-readable medium coupled to theone or more processors or processor cores and having instructionscontained thereon, which, when executed by the one or more processors orprocessor cores, causes the one or more processors or processor cores toperform the operations of: translating a program to execute on aplurality of processors, processor cores, or threads including:analyzing data dependence of program segments of the program; collectingstatistics that include at least one of: a number of instructionsexecuted in a program segment, an average number of cycles used by aninstruction, and a number of speculative reads and writes in a programsegment; and predicting a speedup gain from speculative executionincluding: determining an execution time of a sequential execution ofthe program, and deriving the speedup gain by applying an analyticalmodel.
 15. The system of claim 14, where the analytical model includesexecution times of a plurality of speculative methods, and the executiontimes of the plurality of speculative methods equal the execution timesof a slowest program segment.
 16. The system of claim 15, wheretranslating the program to execute on a plurality of processors,processor cores, or threads further comprises: selecting a selectedspeculative method from the plurality of speculative methods using thespeedup gain.
 17. The system of claim 16, further comprising: executingthe program using the selected speculative method.